Automatic Building of Synthetic Voices from Audio Books

نویسندگان

  • Kishore Prahallad
  • Mosur Ravishankar
  • Tanja Schultz
  • Keiichi Tokuda
چکیده

Current state-of-the-art text-to-speech systems produce intelligible speech but lack the prosody of natural utterances. Building better models of prosody involves development of prosodically rich speech databases. However, development of such speech databases requires a large amount of effort and time. An alternative is to exploit story style monologues (long speech files) in audio books. These monologues already encapsulate rich prosody including varied intonation contours, pitch accents and phrasing patterns. Thus, audio books act as excellent candidates for building prosodic models and natural sounding synthetic voices. The processing of such audio books poses several challenges including segmentation of long speech files, detection of mispronunciations, extraction and evaluation of representations of prosody. In this thesis, we address the issues of segmentation of long speech files, capturing prosodic phrasing patterns of a speaker, and conversion of speaker characteristics. Techniques developed to address these issues include – text-driven and speech-driven methods for segmentation of long speech files; an unsupervised algorithm for learning speaker-specific phrasing patterns and a voice conversion method by modeling target speaker characteristics. The major conclusions of this thesis are – • Audio books can be used for building synthetic voices. Segmentation of such long speech files can be accomplished without the need for a speech recognition system. • The prosodic phrasing patterns are specific to a speaker. These can be learnt and incorporated to improve the quality of synthetic voices. • Conversion of speaker characteristics can be achieved by modeling speaker-specific features of a target speaker. Finally, the techniques developed in this thesis enable prosody research by leveraging a large number of audio books available in the public domain.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Handling large audio files in audio books for building synthetic voices

One of the issues in using audio books for building a synthetic voice is the segmentation of large audio files. The use of standard forced-alignment to obtain phone boundaries on large audio files fails primarily because of huge memory requirements. Earlier works have attempted to resolve this problem by using large vocabulary speech recognition system employing restricted dictionary and langua...

متن کامل

Automatic building of synthetic voices from large multi-paragraph speech databases

Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show that the primary issue of segmentation of large speech file could be addressed with modifications...

متن کامل

Speech recognition based confidence measures for building voices from untranscribed speech

Today, large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins. In addition, we can effortlessly record and store audio data such as read/lecture/impromptu speech on hand-held devices. These data are rich in prosody, provide a plethora of voices to choose from, and their availability can significantly reduce the overhea...

متن کامل

Developing a unit selection voice given audio without corresponding text

Today, a large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins, etc. In addition, we can effortlessly record and store audio data such as a read, lecture, or impromptu speech on handheld devices. These data are rich in prosody and provide a plethora of voices to choose from, and their availability can significantly re...

متن کامل

Influence of speaker familiarity on blind and visually impaired children's perception of synthetic voices in audio games

In this paper we evaluate how speaker familiarity influences the engagement times and performance of blind school children when playing audio games made with different synthetic voices. We developed synthetic voices of school children, their teachers and of speakers that were unfamiliar to them and used each of these voices to create variants of two audio games: a memory game and a labyrinth ga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010